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1.  INTRODUCTION 

The  FrjREKA  system  (1)  is  a  free  text.  information  retrieval 
system  in  operation  on  a  PDP~11/40  computer  system  at  the 
University  of  Illinois.  It  is  used  extensively  as  an  experimental 
tool  to  determine  the  desirability  of  various  features  in  such  a 
system. 

Any  information  retrieval  system,  whether  it  uses  a 
controlled  vocabulary  index  or  free  text  searching,  has  the 
problem  of  matching  the  terms  and  language  of  tho  searcher  with 
those  used  in  the  controlled  index  or  in  the  documents  themselves 
(2) .  If  this  problem  is  not  solved,  then  it  is  to  be  expected 
that  search  recall  ratios  will  suffer  because  the  searcher  is  not 
presenting  the  correct  terms  in  his  searches. 

There  is  much  to  commend  a  free  text  information  retrieval 
system  such  as  FU3I\KA  when  it  is  use»d  in  non-delegated  search  mode 
by  practitioners  in  the  field.  The  natural  language  of  the 
documents  is  likely  to  match  the  language  of  the  searcher  more 
closely  than  any  controlled  vocabulary  will.  Since  most  people 
have  slightly  different  vocabularies,  searchers  will  often  not  use 
the  exact  te»rms  used  in  the  documents.  Unfortunately,  the 
searcher  in  such  a  system  has  a  very  large  number  of  words  to 
choose  from  and  must  specify  all  of  the  possible  terms  to  assure  a 
high  level  of  recall. 


When  conducting  a  search  in  thr.  BURtiKA  system,  the  ircher 
is  continually  creating  synonym  tables  in  t-ho  form  of  boolean 
search  expressions  that  he  develops.   He   iray   not   think   of   *11 


possihlp  ways  in  wliich  a  particular  topic  nay  be  expressed,  and 
even  if  he  does  discover  most  of  the  terms  used  in  the  documents, 
the  time  expended  may  be  considerable  and  will  reiuce  the 
efficiency  of  his  search.  Moreover,  this  effort  is  lost  once  the 
search  is  completed  and  the  identical  process  mav  bo  repeated  many 
times. 

To  help  overcome  these  problems,  the  searcher  in  a  free  text 
system  needs  a  thesaurus  or  similar  aid  to  control  synonyms  and  to 
group  related  words  together.  Such  a  thesaurus  would  normally 
consist  largely  of  tables  of  synonyms  or  near  synonyms.  Por 
example  the  thesaurus  should  remind  the  user  that  'factory'  may 
also  appear  as  'industrial  plant'  or  'warehouse1.  The  thesaurus 
will  also  perform  the  important  task  of  preserving  portions  of 
search  strategies  over  an  extended  period  of  time  for  all  users. 
It  is  desirable  that  the  user  be  able  to  nranipulate  and  examine 
such  a  thesaurus  on-lir.e.  Moreover,  the  searcher  should  be  able 
to  have  all  synonyms  for  an  input  term  substituted  automatically 
in  his  search. 

This  thesis  describes  a  thesaurus  facility  which  has  been 
introduced  with  these  features  to  ease  the  burden  a  searcher  must 
carry  during   his   search.    Chapter   2   describes   some   specific 


reasons  for  recall  failure  and  the  thesaurus  facilities  introduced 
to  overcome  these.  In  chapter  3  the  explicit  thesaurus  commands 
are  described  and  Chapter  4  is  a  description  of  the  actual 
i  implementation  of  these  features.  Chapter  5  is  a  summary  of  the 
facilities  available. 


2.     BASIS  FOP  THFSAURIIS  FACILITIES 


2.1  Reasons  for  Recall  failure 

When  a  user  is  conduct inq  a  search  he  is  -greatly  interested 
in  the  recall  ratio  and  precision  ratio  of  his  search  [1) .  The 
recall  ratio  is  the  proportion  of  the  material  available  in  the 
data  base  that  is  found  by  a  search,  and  the  precision  ratio  is 
the  proportion  of  the  material  found  that  is  judged  by  the  user  to 
be  relevant.  As  described  in  the  following  sections,  a  thesaurus 
feature  is  primarily  to  increase  the  recall  ratio  of  a  search,  but 
also  has  an  impact  on  the  precision  ratio. 


Salton  (U)  has  found  that  many  recall  failures  in  free  text 
systems  are  due  to  three  different  problems  with  the  language  of 
searchers  and  documents.  The  first  of  these  is  due  to  synonyms, 
where  words  with  the  same  meaning  are  used  interchangeably.  It 
may  be  that  different  documents  use  different  terms  for  the  same 
concept,  or  the  searcher  uses  a  different  term  from  that  found  in 
the  documents.  Examples  of  such  cases  are  'dark*  and  'night'  or 
•fright'  and  'scare'.  Another  form  of  synonym  is  the  conceptual 
group  containing  words  which  are  related  but  are  not  identical  in 
meaning,  such  as  the  terms  'brain',  'nervous  system'  and  'spinal 
column' . 


The  second  problem  is  caused  by  variant  spellings  of  the  same 
word.  Such  variations  in  spelling  are  due  to  different  tenses  of 
the  word.  For  example  we  may  have  • factory'  and  'factories'  or  we 
may  have  'control' ,  'controller',  ' controlling ' ,  'controlled'  and 
•controls*.  In  these  cases  a  searcher  would  find  it  difficult  to 
think  of  all  word  endings  for  inclusion  in  a  search. 

The  third  source  of  difficulty  is  the  occurrence  of  different 
grammatical  constructions,  for  example  the  concept  of  'birth 
control'  may  alternatively  be  expressed  as  'pregnancy  prevention' , 
'prevention  of  pregnancy',  'control  of  the  birth  rate'  etc.  The 
searcher  without  the  use  of  a  controlled  index  where  one  term  is 
adopted  for  all  such  alternatives,  would  not  usually  be  able  to 
think  of  all  of  these  phrases  in  his  search. 

2.2  Increasing  the  Pecall  Ratio 

The  three  problems  outlined  in  the  previous  section,  have  been 
addressed  and  a  thesaurus  facility  introduced  to  the  EUREKA  system 
to  improve  the  recall  ratio  in  that  system. 

One  method  of  overcoming  these  problenrs  is  to  base  the  system 
on  a  precontrolled  vocabulary.  Rather  than  using  the  words  of  the 
document  text  in  the  indexing,  these  words  are  converted  to  the 
word  stem  or  to  predetermined  concept  classes.  If  word  stems  are 
used  as  the  basis  for  this  approach,  the  problem  of  synonyms  must 
still   be   solved.    When   concept   classes   are  used  the  problems 


involved  with  a  controlled  vocabulary  ire  reintroduced  and  all 
terms  used  in  thr>  natural  language  searches  must  be  converted  to 
the  appropriate  concept  classes.  In  addition  the  ZUREKA  system 
has  an  index  based  on  all  words  in  the  text  of  documents  and  it 
was  considered  a  major  task  to  convert  it  to  a  precontrolled 
index. 

The*  approach  taken  was  to  introduce  a  thesaurus  which  would 
be  placed  between  the  natural  lanquaqe  of  the  searcher  and  the 
free  text  index.  All  the  terms  in  a  concept  class  are  stored  as 
one  entry  in  the  thesaurus.  The  statements  entered  by  the 
searcher  are  matched  against  all  entries  in  the  thesaurus,  and  if 
a  match  occurs,  the  term  in  the  search  statement  is  replaced  by 
the  thesaurus  entry.  In  this  way  the  search  statements  are 
expanded  before  being  passed  on  to  the  index.  Thp  facilities 
offered  by  the  thesaurus  are  limited  to  constructs  which  can  be 
placed  in  the  thesaurus  entries.  The  following  sections  describe 
the  facilities  available  to  overcome  each  of  the  three  problems 
outlined  in  the  last  section  as  causing  recall  failure. 

Unfortunately  a  thesaurus  feature  has  several  disadvantages. 
While  increasing  the  recall  of  a  search,  it  is  also  likely  to 
increase  precision  failures  due  to  false  coordinations  and 
incorrect  term  relationships.  Several  features  of  the  thesaurus 
facility  have  been  specifically  designed  to  minimize  the  effect  of 
this   problem.    Tn   addition,  the  EUREKA  system  is  designed  to  be 


used  on-line,  with  the  user  interacting  with  the  system  to 
dynamically  develop  his  search  strategy.  In  this  way  the  user  is 
able  to  monitor  the  precision  of  his  search  by  examining  the 
number  of  documents  retrieved,  and  improve  it  by  development  of  a 
usefull  search  strategy. 

A  thesarus  also  requires  that  a  considerable  vocabulary  of 
synonyms  be  constructed  which  may  be  difficult  to  store  and 
maintain.  These  synonyms  will  generally  have  to  be  maintained  as 
the  vocabularv  of  searchers  and  documents  changes  with  time. 
Facilities  for  interactive  maintenance  of  the  thesaurus  are 
provided  to  minimize  this  problem. 

2.3  Synonyms 

Synonyms  are  the  constructs  which  form  the  basis  for  the 
operation  of  the  whole  thesaurus.  Words  which  have  the  same 
meaning  can  be  entered  into  a  concept  class  within  the  thesaurus 
using  the  boolean  logic  which  is  the  basis  of  all  EUREKA  searches. 
For  example  when  we  search  for  the  concept  'darkness',  we  also 
want  any  of  the  terms  'dark'  or  'night'  or  'black'.  Putting  this 
in  a  boolean  expression  as  it  would  appear  in  a  FIND  statement  or 
a  thesaurus  concept  class  we  get  ' darkness'* 'dark'+ 'night • 
♦'black'.  Other  facilities  available  in  the  thesaurus  also  depend 
on  this  construction  of  a  concept  class  by  including  all  words  or 
word  variations  which  have  the  same  or  similar  meanings. 


2.U  Spollinq  Variants 

The  EfHEKA  system  has  a  universal  character  *,  which  all  . 
any  endinq  t.o  appear  after  a  word  stem.  As  an  example  of  this, 
•factor*'  will  find  all  terms  containinq  the  word  stem  'factor1 
followed  by  any  endinq.  It  can  be  used  to  find  all  of  the  terms 
•  factor ' , ' factor ize '  and  'factorization'.  Unfortunately  the  scope 
of  the  universal  character  is  indiscriminate  and  it  will  also 
detect  'factory'  and  'factories',  causinq  a  drop  in  the  precision 
ratio  for  the  search.  A  more  extreme  example  is  the  use  of  'dt* 
to  search  for  'die'  or  'dyinq'  or  'died'.  In  cases  such  as  this, 
the  universal  character  is  obviously  useless. 


The  thesaurus  allows  permanent  storaqe  of  all  variants  of  a 
word  as  synonyms  in  a  thesaurus  entry.  The  different  versions  of 
the  same  word  are  assumed  to  have  the  same  meaninq  and  would  be 
stored  as  • die •+• dyinq  '+• died • .  This  approach  has  the  advantaqe 
that  we  can  differentiate  between  different  meaninqs  for  the  same 
word  stem  with  a  variety  of  word  endinqs,  dependinq  on  the 
thesaurus  entry  they  appear  in.   "or  example 

'factor'+'factorize'+'factorization' 
would  appear  in  one  entry  and 

•  factory' ♦*  factories' 
in  another  entry,  and  the  two  entries  are  never  confused. 


In  order  to  conserve  storage  space,  a  shorthand  method  is 
available  for  storing  different  word  endings.  The  word  stem  is 
ended  by  a  colon  and  is  followed  by  the  allowable  endings  (which 
may  include  null)  ,  seoarated  by  commas.  Using  the  examples  from 
the  last  paragraph,  we  would  have  *  factor :, itemization'  and 
•  factor :v,ies' .  These  word  variants  can  appear  in  the  same 
thesaurus  entry  together  with  different  words  with  the  same 
meaning,  as  in 

'  factor: y , ies •  +  • warehouse* •  +  ' ind  ustry ' 

In  addition  to  this  facility  for  storing  word  variants  in  the 
thesaurus,  an  additionall  feature  is  provided  for  terms  used  in  a 
search  statement  which  do  not  appear  in  any  thesaurus  entry.  The 
analysis  by  Winograd  (5)  is  used  to  convert  plural  terms  to  the 
sinqular  equivalent  and  terms  with  the  special  endings 
1  ly ' , ' ing ■ , 'er * , *en ' , 'ed •  and  'est'  to  the  singular  word  stem. 
Singular  terms  are  also  converted  to  the  plural  form.  For  example 
the  search  expression 

*  watch '  +  ' babies'  +  'rising ' 
is  expanded  to 

'watch'+'watches'+'babies'+'baby'+'rising'+'rise* 
and  'prettily'  is  expanded  to  ' prettily '+• pretty • . 


In  a  FIND  statement  it  is  assumed  that  words  with  special 
endings  are  usually  verbs,  adverbs  or  adjectives,  and  do  not  have 
a  plural  form.   This  is  not  true  in  all  cases,  but  occurs  so  often 
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that  automatic  analysis  cannot-  be  don<  .   Similarly  tei  *  ^d 

in  tho  singular  or  plural  formr.  a  r  e  assumed  to  h^  nouns  and  -nc 
not  expanded  to  include  the  special  endings.  It  is  also  assumed 
that  words  which  appear  in  a  thesaurus  entry  will  have  all 
possible  endinqs  already  associated  with  them  in  the  thesaurus, 
and  so  no  additional  analysis  is  lor.e. 

2.5  Different  Grammatical  Constructs 

As  a  standard  feature,  the  tUNFKA  system  allows  the  us^r  to 
enter  phrases  as  a  single  term  in  a  search,  for  example  'birth 
control*  and  'prevention  of  pregnancy'.  These  are  also  treated  by 
the  thesaurus  as  single  terms  and  hence  can  be  stored  and  referred 
to  in  the  standard  manner.  As  an  example  we  may  have  as  thesaurus 
entries 

'birth  control' +' cont raceptive' +' pregnancy  prevention' 
and        ' factor: y  ,  ies '+' warehouse* '  +  ' i nd ustrial  plant' 

Unf ortunately  phrases  used  in  this  rranner  must  match  the 
phrases  present  in  the  documents  exactly.  To  be  completely 
effective,  a  thesaurus  entry  must  contain  all  the  possible  phrases 
expressing  each  concept.  This  is  obviously  difficult  to  establish 
and  maintain. 

The  FUREKA  system  contains  another  feature  which  overcomes 
this  last  problem,  but  also  reduces  the  precision  ratio.  This  is 
achieved  by  using  statistical  association   with   the   boolean   AND 
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function,  denoted  by  *,  which  assures  that  two  terms  appear  in  the 
same  context.  The  context  may  be  a  full  document,  a  paragraph  or 
a  sentence.  Tt  is  to  be  expected  that  in  some  cases  the  required 
terms  would  appear  together,  but  not  related  to  each  other,  or. 
with  a  meaninq  different  from  the  one  required.  These  are  false 
coordinations  and  incorrect  term  relationships  and  increase  the 
precision  failure  in  a  search.  For  example 
1  pregnancy '*' prevention1  would  incorrectly  retrieve  'prevention  of 
hysterical  fathers  during  pregnancy'. 

When  required,  these  term  relationships  can  be  stored  in  the 
thesaurus  in  the  form  usually  used  in  searches,  surrounded  by 
parentheses,  as  in  ('pregnancy'  *  'prevention').  Such 
relationships  will  not  be  matched  to  any  terms  in  a  FIND 
statement.  Thus  to  retrieve  such  an  expression,  a  word  or  phrase 
must  appear  in  the  thesaurus  entry  and  rrust  also  be  used  in  the 
search  statement.   For  example 

'birth  control'* ('pregnancy'*'prevention') 
can  only   be   used   if   'birth   control'   appears   in   the   search 
statement. 

To  be  most  effective,  these  term  relationships  should  be  used 
in  a  restricted  context  such  as  sentence  or  paragraph. 
Unfortunately  the  thesaurus  is  incapable  of  forcing  such  a  context 
and  uses  full  documents  as  the  context  unless  the  user  explicitly 
specifies  otherwise. 
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Another  method  of  handling  different   grammatical       »    • 
ir.  to  do  a   full   syntactical  Analysis  of  the  document  text  to 
discover  all  syntactic  equivalents   of   the   qiven   phrase.    This 
approach   was   considered   far   too   comnlox   and   slow   to   he  an 
effective  tool  in  the  on-line  environment  of  EUREKA. 

2.6  Words  with  Multiple  Meanings 

In  any  free  text  information  retrieval  system,  multiple 
meanings  are  a  problem  when  searches  are  heinq  conducted.  They 
lead  to  a  decrease  in  precision  due  to  false  coordinations  and 
incorrect  term  relationships.  In  the  thesaurus  facility  in  the 
EUREKA  system,  a  word  with  multiple  meanings  may  appear  in  more 
than  one  thesaurus  entry  and  will  be  flagged  as  having  multiple 
appearances.  When  a  search  statement  containing  the  term  is 
entered,  the  system  displays  each  thesaurus  entry  and  asks  the 
user  if  it  has  the  correct  meaning. 

If  a  term  has  multiple  meanings,  but  only  appears  in  the 
thesaurus  once,  the  system  is  unaware  of  the  alternative  meanings 
and  will  automatically  use  the  single  entry  whenever  the  term  is 
used. 


When  a  FIND  statement  includes  a  term  with  the  universal 
character  #,  the  system  assumes  that  it  may  match  more  than  one 
thesaurus  entry  and  so  displays  each  one  that  is  matched  and  asks 


13 

if  it  is  the  correct  one.   Only  one  such  entry   will   replace   the 
input  term. 

2.7  Different  nser  Environments 

Fach  user  of  an  information  retrieval  system  will  operate  -co 
some  extent  in  his  own  environment.  His  information  requirements, 
vocabulary  and  expectations  from  the  systeir  will  be  different  from 
other  users,  even  if  they  are  working  in  a  similar  field.  For 
this  reason,  a  Universal  Thesaurus  is  available  to  all  users,  and 
each  individual  user  is  given  the  full  thesaurus  features 
available  in  his  own  User  Thesaurus.  He  alone  is  completely 
responsible  for  the  maintenance  and  use  of  this  thesaurus,  and  may 
store  in  it  whatever  he  chooses.  The  user  may  select  between  the 
use  of  his  own  User  Thesaurus,  use  of  the  Universal  Thesaurus,  or 
use  of  both,  thus  giving  him  considerable  flexiblity. 

It  is  possible  to  use  only  the  Universal  Thesaurus  for 
general  queries  and  then  select  his  User  Thesaurus  for  searches  in 
a  particular  field.  This  feature  is  described  in  more  detail  in 
Section  3.4.  Tt  gives  the  user  the  ability  to  dynamically  change 
his  search  environment.  In  this  way,  each  user  can  use  thesaurus 
entries  tailored  to  his  own  individual  needs  without  interfering 
at  all  with  other  users. 


1U 

2.8  riser  Control  of  Thesaurus  Features 

When  a  feature  such  as  a  thesaurus  automatically  alters  the 
search  statements  entered  by  the  user,  the  user  must  have  the 
ability  to  control  its  use.  This  is  accomplished  in  the  case  of 
the  thesaurus  by  allowing  the  user  to  selectively  turn  automatic 
features  on  and  off.  These  features  include  use  of  the  whole 
thesaurus,  selective  use  of  the  Universal  Thesaurus  and  the  User 
Thesaurus,  display  of  the  expanded  form  of  the  search  statement, 
use  of  the  special  word  endings  feature  and  use  of  the  plural 
words  feature.  These  can  all  be  controlled  for  each  individual 
statement,  qivinq  the  user  qreat.  flexibility.  The  user  is  able  to 
interactively  decide  if  the  thesaurus  is  introducinq  incorrect 
terms  into  a  particular  search  statement  and  can  improve  the 
precision  of  his  search  by  turninq  thesaurus  features  off  for  that 
search. 

2.9  Construction  of  the  Thesaurus 


The  construction  of  the  thesaurus  is  almost  completely  a 
manual  task.  the  user  must  think  of  synonyms,  phrases  and  term 
relationships  to  be  entered  into  the  thesaurus  using  the  ENTER 
command.  Some  assistance  is  qiven  to  the  user  for  qeneratinq 
different  endinqs  to  each  word  entered,  based  aqain  on  the  work  of 
Winoqrad  (5).  When  the  singular  form  of  a  term  is  entered,  the 
plural  form  is  automatically  derived,  the  singular  form  is  derived 
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from  the  plural  or  one  of  the  special  endings  ' ly* , • ing* , ' er ' , 
'en','ed'  and  'est1.  Each  of  the  special  endings  mentioned  above 
is  then  added  to  the  singular  form  and  the  user  is  interactively 
asked  to  determine  whether  the  resultant  word  has  the  correct 
meaning.  This  allows  the  automatic  generation  of  a  wide  range  of 
commonly  used  endings  and  at  the  same  time  removes  erroneous 
versions  of  words.  For  example,  if  the  term  'fast'  was  entered, 
the  user  may  accept  'faster1  and  'fastest'  but  re-ject  the  incorrct 
meanings  'fasting',  'fasted'  and  'fasten'  and  the  nonsense  word 
•fastly'. 
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3.  THESAURUS  COMMAND 

The  thesaurus  is  implicitly  referenced  by  every  TMD 
statement  as  described  in  the  preceding  sections.  AIL  statements 
which  explicitly  reference  the  thesaurus  are  grouped  together  as 
the  THESAURUS  command.  The  keyword  THESAURUS  in  this  statement 
must  be  immediately  followed  by  a  second  keyword  to  identify  the 
particular  type  of  thesaurus  facilities  teguired.  In  most  cases 
additional  information  is  also  reguired  in  the  command.  The  form 
of  all  variations  of  the  THESAURUS  command  are  shown  in  Table 
3.1.1.  Keywords  in  these  statements  are  shown  in  upper  case 
letters,  and  may  be  abbreviated  to  any  number  of  characters  which 
uniguely  identify  them. 


THESAURUS  ENTER   <Fxpression> 

THESAURUS  DISPLAY   <Expression> 

THESAURUS  DISPLAY  ALL 

THESAURUS  DELETE   <Term> 

THESAURUS  [ON  !  OFF  ]       [ALL  !  USER  !  UNIV  !  EXPANSION 

!  WORDENDING  !  PLURAL] 
<Fxpression>  ::=  <Term>  !  <Term>  +  <Sxpression> 
Table  3. 1. 1 
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As  described  in  previous  sections,  <Tenn>  may  be  a  word,  a 
phrase,  a  word  containing  the  universal  character  #,  or  a  word 
stem  followed  by  :  and  one  or  more  word  endinqs  separated  by 
commas.  Each  of  these  forms  must  be  enclosed  in  quotes  as  in 
other  EUREKA  commands.  The  <Term>  may  also  be  a  term  relationship 
of  two  or  more  terms  separated  by  *  and  enclosed  by  parentheses. 

The  following  sections  describe  the  different  options 
available  in  the  THE3AUPTIS  command.  Examples  are  included  to 
illustrate  the  use  of  the  facilities. 

1.1  ENTEP  Option 

This  command  is  used  to  enter  terms  with  the  same  meaning 
into  the  thesaurus.  Tach  concept  class  in  the  thesaurus  is 
searched  for  the  occurrence  of  any  of  the  terms  in  the  search 
expression.  Each  of  the  terms  which  do  not  occur  already,  and 
which  do  not  have  multiple  endings  or  a  #  in  them,  will  be  given 
the  word  ending  treatment.  If  none  of  the  terms  occur  already,  a 
new  entry  is  created.  For  example,  assuming  the  User  Thesaurus  is 
initially  empty,  we  would  get  the  following  seguence  for  the 
command 

T  E  'CALL' 

DO  FOLLOWING  WORDS  HAVE  THE  SAME  MEANING  (Y  OR  N) 

•CALLY' 

N 
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•CALLING' 

Y 

'CALLFP' 

v 

•CALLEN' 

N 

'CALLFP' 

Y 

•CALLFST' 

N 

'CALL: rSfING,ED,E8l 

WILL  ONLY  ENTER  INTO  USEE  THESAURUS 

This  will  create  a  new  concept  class  as  shown  in  the   second   last 

line  of  the  example.   Similarly  the  commands 

T  E  'FACTOF:Y,IES'  ♦  •  WAREHOUSE*  •  ♦  'INDUSTRIAL  PLANT' 
T  E  'BIRTH  CONTROL'+  (•PREGNANCY1*  'PREVENTION') 

will  create  the  concept  classes 

'FACTOR: Y,IES'  + • W AP EHOUS E# •  ♦  'INDUSTRIAL  PLANT:, S« 
' PTRTH  CONTROL:  ,S'  +  (  ' PREGN ANCY • * »  PREVENTION') 

When  a  conceDt  class  is  found  which  contains  one  of  the  terms 
in  the  command,  it.  will  be  displayed  and  the  user  asked  if  it  has 
the  correct  meaning.  If  it  does,  the  thesaurus  concept  class  will 
replace  the  term  in  the  FNTSP  command,  and  will  then  be  deleted. 
The  search  is  then  continued  to  find   any   other   concept   classes 
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containing  any  of  the  other  terms  in  the  original  command.  When 
the  search  is  completed,  the  expanded  expression  is  entered  into 
the  thesaurus  as  a  new  concept  class.   As  an  example,  the  command 

T  E  ■ SHOUTING •  +  'CALL' 
will  match  the  concept  class  entered  in  the   first   example.    The 
term   'CALL'   will   be  replaced  by  that  concept  class,  and  the  new 
concept  class  created  wi 11  be 

'  SHOUT-.TNG,  ,S,ED,EP'  ♦  •  CALL  :  ,  S  ,  I N  G,  ED ,  ER  ' 

In  this  way  ,all  the  terms  in  the  FNTFR  command  which  exist 
in  a  thesaurus  concept  class  are  replaced  by  the  appropriate 
concept  class.  As  a  result,  if  two  different  terms  in  the  ENTER 
statement  already  appear  in  different  thesaurus  concept  classes, 
these  classes  will  be  combined  into  a  single  large  entry.  Since 
this  process  is  repeated  for  all  terms  entered  into  the  thesaurus, 
the  same  term  should  not  appear  in  two  different  concept  classes 
with  the  same  meaning. 

If  a  term  exists  already  in  a  concept  but  has  a  different 
meaning  to  the  one  being  entered,  it  is  flagged  as  a  multiple 
meaning  and  a  new  entry  is  created  containing  the  new  terms. 

3.2  DISPLAY  Option 

The  DISPLAY  command  is  used  to  display  all  thesaurus  entries 
containing  any  of  the  terms  or  phrases  specified.  The  terms  may 
be  any  of  the  forms  described  for  the   ENT  FR   command   as   in   the 


followinq  examples. 

T  DIS  'WAREHOUSE*' 

nsnn  THESAURUS 

•FACTOR: Y,TES'  ♦  'WAREHOUSE*'  »  'INDUSTRIAL  PLANT:, S1 


The  DISPLAY  ALL  command  is  used  to  display  all  entries  in  the 
thesaurus,  as  shown  below. 
THES  DIS  ALL 
USER  THESAURUS 

•FACTOR: Y,TES'  + ' W A PEHOUSE* '  ♦  'INDUSTRIAL  PLANT:, S' 
'RIRTH  CONTROL:  ,  S'  +  (• PR FG NANCY'  ♦•PREVENTION' ) 
'SHOrjTtlNG^S.EDfEP1  ♦  '  CALL  :  ,  S  ,ING,ED,  ER' 

3.3  DELETE  option 

If  a  thesaurus  entry  is  in  error  it  nay  be  deleted  by 
specifyinq  in  a  DELETE  statement  any  term  which  occurs  within  it. 
Each  concept  class  which  contains  this  terrr  will  be  displayed  and 
the  user  asked  if  he  wants  it  deleted.  This  provides  the  user 
with  an  opportunity  to  reconsider,  and  alsc  allows  duplicates  to 
be  deleted  individually.  Using  the  previous  examples,  the  user 
may  decide  that  'shout'  and  'call'  are  not  really  synonyms,  and  so 
would  enter  the  followinq  command. 
T  DEL  'CALL' 
'SHOUTtING, ,S,ED,EP'  +  ' CALL : , S , IN G, ED, ER » 
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DO  YOn  REALLY  WANT  TO  DELETE  THIS  SYNONYM  (Y  OR  N) 
Y 

3.U  ON  and  OFF  Options 

These  are  to  allow  the  user  to  control  some  of  the  facilities 
of  the  thesaurus  as  described  earlier.  The  keyword  ALL  turns  the 
whole  thesaurus  feature  on  and  off  by  simultaneously  turninq  both 
the  Universal  Thesaurus  and  the  User  Theaurus  on  or  off.  The 
keywords  UNIVERSAL  and  USFR  are  used  to  turn  on  and  off  the 
Universal  Thesaurus  and  the  User  Thesaurus  respectively.  When 
both  of  these  are  turned  off,  the  thesaurus  feature  is  not  used  at 
all.  If  one  thesaurus  is  turned  on,  FINE  commands  and  THESAURUS 
commands  operate  on  it  only.  When  both  are  turned  on,  the  ENTFR 
and  DELETE  options  of  the  THESAURUS  command  will  only  operate  on 
the  User  thesaurus,  the  DISPLAY  option  operates  on  both,  and  FIND 
commands  will  search  first  the  User  Thesaurus  and  then  the 
Universal  Thesaurus.  Only  the  first  match  found  will  be  used,  so 
that  if  the  same  word  appears  in  one  concept  in  the  Universal 
Thesaurus  and  one  concept  in  the  User  Thesaurus,  only  the  User 
Thesaurus  concept  will  be  used. 

The  keyword  EXPANSION  is  used  to  control  the  display  of  the 
expanded  form  of  each  FIND  command.  Automatic  analysis  of  special 
endings  is  turned  on  or  off  by  usinq  WORDENDING,  and  similarly 
PLURALS   controls   generation   of   plural   forms  and  conversion  of 


22 

plurals  to  the  singular  form.   Examples  are 

T  OFF  ALL 

TMES  OFF  PLURALS 

T  ON  USER 
These  commands  will  turn  off  the  wholf  thesaurus  and   the   plurals 
processing,  and  thon  turn  the  User  Thesaurus  back  on  agiin. 
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4.  IMPLEMENTATION 

4.1  Overall  View 

All  of  the  thesaurus  routines  are  executed  under  the  control 
of  the  thesaurus  search  module  THESCH.  In  the  case  of  the  FIND 
command,  the  thesaurus  routines  are  called  by  the  routine  PARSER 
for  each  term  in  the  search  expression  (although  the  thesaurus  can 
handle  the  whole  search  expression  after  the  expansion  of  macro 
calls).  Each  THESAURUS  command  is  handled  by  one  call  to  THESCH 
from  PARSUB. 


Byte  Contents 

0-1     Address  of  the  user  Loqon  Block. 

2-3     Command  code:     21  =  THESAURUS  ENTER 

22  =  FIND 

23  =  THESAURUS  DISPLAY 

24  =  THESAURUS  DELETE 

25  =  THESAURUS  DISPLAY  ALL 
4-5     Address  of  input  expression. 

fi-1     Address  for  output  expression  {if  any  required) 

Thesaurus  Command  Table  Structure 
Table  4.1.1 
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The  routine  THFSCH  sets  up  a  thesaurus  command  table,  shown 
in  Table  U.1.1,  which  is  passed  to  the  routine  THSPCH  for 
execution. 

For  each  thesaurus  file  currently  turned  on,  each  concept 
class  is  searched  for  any  terms  used  in  the  command.   If  any  are 
found,  the  appropriate  action  is  taken  for  the  particular  command 
and  the  search  qoos  on  until  the  end  of  file  is  reached,  or  all 
terms  in  the  command  have  been  found. 

The  routines  used  in  this  process  are  as  follows 

THSETA  -  To  set  ud  addresses  of  terms  in  the  command  search 
expression . 

THSCHS  -  To  search  the  current  concept  class  for  any  terms  which 
match  the  command  search  expression. 

THESIO  -  Perform  standard  disk  and  terminal  i/o. 

THMSYN  -  Replaces  a  term  in  the  search  expression  with  the  current 
concept  class  (assumed  to  match  it). 

THCRSN  -  Creates  a  new  concept  class  for  the  expanded  search 
expression. 

TMWEND  -  Does  word  analysis  to  reduce  special  endings  and  plurals 
to  the  sinqular  form,  and  convert  singular  words  to  the 
plural  form. 
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THENTW  -  Adds  the  special  endings  to  the  singular  form  of  the  word 
for  the  PINTER  command  only. 

4.2  Structure  of  Thesaurus  Files 

The  Users  Thesaurus  is  stored  with  other  user  specific 
information  such  as  macros,  query  history  and  comments,  in  the 
User  File.  The  thesaurus  starts  at  block  240  and  is  only  limited 
in  length  by  the  size  of  the  User  File.  The  Universal  Thesaurus 
starts  in  block  number  1  of  the  file  UNIVTH.SYN.  In  both  of  these 
files,  each  block  is  256  words,  or  512  bytes  in  length.  The  first 
block  in  each  file  has  the  structure  shown  in  Table  4.2.1  and 
other  blocks  have  the  structure  shown  in  Table  4.2.2.  The  only 
difference  between  these  two  structures  is  that  bytes  2-3  of  the 
first  block  contain  the  number  of  the  last  block  of  the  file  which 
is  being  used. 


Bytes  Contents 

0-1  Number  of  the  last  byte  used  in  this  block. 

2-3  Number  of  the  last  block  used  in  the  file. 

4-511  Thesaurus  concept  classes. 

Structure  of  First  Block  of  Thesaurus  Files 

Table  4. 2. 1 
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Bytes  Contents 

0-1         Number  of  the  last  byte  used  in  this  bl', 
2-511       Thesaurus  concept  classes. 

Structure  of  Other  Blocks  in  Thesaurus  Files 

Table  4.2.2 


Bytes  Contents 

0-1         Length  of  this  concept  class   =  L 

(includes  bytes  0-7). 
2-5         Bits  indicating  if  the  corresponding  terms  in 
the  concept  class  have  a  duplicate  meaning. 

6  Number  of  terms  in  this  concept  class. 

7  Not  used. 

B-L         Terms  in  the  concept  class. 

Concept  Class  Structure 
Table  4.2. 3 


In  Table  4.2.3  is  shown  the  structure  of  each  concept  class 
stored  in  the  thesaurus  files.  Each  concept  class  must  reside  in 
a  single  block  and  hence  L  is  limited   to   512   bytes.    The   bits 
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indicating   if  the  corresponding  terms  have  multiple  meanings,  are 
set  to  0  normally,  and  1  if  the  term  has  a  multiple  meaning. 
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5.  CONCLUSION 

The  preceding  sections  have  described  a  thesaurus  facility 
within  the  ^UREKA  system.  It  addresses  three  major  causes  of 
recall  failure  due  to  language  problems  and  allows  parts  of  search 
statements  in  the  form  of  synonyms  to  be  stored  for  repeated  use. 
Commands  are  provided  for  the  user  to  make  entries  to  the 
thesaurus  or  delete  these  entries.  Users  may  also  display  the 
contents  of  the  thesaurus  and  control  the  automatic  facilities 
which  it  provides. 

The  construction  of  such  synonym  tables  requires  a 
considerable  expenditure  of  human  intellectual  effort. 
Nevertheless  such  searching  aids  will  hopefully  raise  the  average 
performance  capabilities  of  the  free  text  information  retrieval 
system  dramatically.  These  synonym  tables  could  repay  their  cost 
manyfold  in  savinq  the  time  and  intellectual  effort  of  users,  thus 
leading  to  overall  economy  in  the  system. 
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APPENDIX  A 
Error  Messages 

NEW  SYNONYM  TOO  LONG 

Cause  -  After  combination  with  concept  classes  with  the  same 
meaning,  already  in  the  thesaurus,  the  expression  being 
entered  is  too  long  to  be  stored  in  the  file.  It  is  over  502 
characters  in  length. 

SYNONYM  IS  SMALLER  THAN  TERM  SO  NOT  USED 

Cause  -  A  term  in  the  last,  command  matches  a  synonym  which  is 
shorter  than  the  actual  term,  and  therefore  likely  to  be  more 
restrictive. 

THESAURUS  FILE  ERROR 

Cause  -  The  thesaurus  file  currently  in  use  has  been 
corrupted.   Cry  for  help. 

THESAURUS  FILE  IS  FULL 

Cause  -  There  is  no  room  to  store  any  more  concept  classes  in 
the  thesaurus  currently  turned  on.  Call  a  programmer  to 
reallocate  the  thesaurus  in  a  larger  contiguous  file. 

THESAURUS  -  ILLEGAL  CHARACTER  FOUND 

Cause  -  The  search  expression  or  term  in  the  last  command  was 
not  in  the  correct  form  or  contained  an  illegal  character. 
Re-enter  the  command  correctly. 
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THESAURUS  -  NO  TFRMS  IN  FIND 

Cause  -  The  last  command  entered  had  no  terms  in  the 
expression  part.   Fnter  a  sensible  command. 

UNIVERSAL  THESAURUS  ALFFADY  IN  USP 

Cause  -  Only  one  person  can  enter  information  into  the 
Universal  Thesaurus  at  one  time.  Wait  until  the  current  user 
has  turned  the  Universal  Thesaurus  off. 


WILL  ONLY  ENTER  INTO  USFFS  7HFSAUPUS 

Cause  -  Both  the  Users  Thesaurus  and  the  Universal  Thesaurus 
were  turned  on  uh^n  an  F.NTER  command  was  performed.  The 
expression  entered  will  only  he  stored  in  the  Users 
Thesaurus. 
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